Movie Rating Model and Predictor
Part 5: Modeling
At this stage, several linear regression models, based upon forward selection and backward elimination methods, were developed and evaluated to predict the popularity of a film. Popularity was defined in terms of box office success. Through statistical analysis the log of the number of IMDB votes was the best predictor of box office success, as such this was the response variable. The models and their model selection methods are: Table 1: Prediction Models| Model | Model.Selection | Data |
|---|---|---|
| Alpha | Forward Selection | Full model |
| Beta | Forward Selection | Full model with univariate outliers removed |
| Gamma | Backward Elimination | Full model |
| Delta | Backward Elimination | Full model, influential outliers removed |
Model Selection
Both forward selection and backward elimination with p-values model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.
Forward Selection
The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.
Backward Elimination
The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removie only the most least significant predictor at each step, until all predictors had p-values below the present threshold.
Full Model Selection
In the prior section, association and correlation tests were conducted with a 95% confidence level. All categorical variables were significant (Table 2) and were included in the full model. The decision to remove a quantitative variable was based largely upon its correlation with the response variable (Table 3) and its correlations with other quantitative variables (Table 4). Table 5 lists the variables removed and the rationale.
Table 5: Variables excluded from full model| Type | Variable | Description | Rationale |
|---|---|---|---|
| Categorical | actor1 | First main actor/actress in the abridged cast of the movie | Not predictive without other data |
| Categorical | actor2 | Second main actor/actress in the abridged cast of the movie | Not predictive without other data |
| Categorical | actor3 | Third main actor/actress in the abridged cast of the movie | Not predictive without other data |
| Categorical | actor4 | Fourth main actor/actress in the abridged cast of the movie | Not predictive without other data |
| Categorical | actor5 | Fifth main actor/actress in the abridged cast of the movie | Not predictive without other data |
| Categorical | director | Director of the movie | Not predictive without other data |
| Categorical | dvd_rel_day | Day of the month the movie is released on DVD | No predictive value |
| Categorical | dvd_rel_month | Month the movie is released on DVD | No predictive value |
| Categorical | dvd_rel_year | Year the movie is released on DVD | No predictive value |
| Categorical | imdb_url | Link to IMDB page for the movie | No predictive value |
| Categorical | rt_url | Link to Rotten Tomatoes page for the movie | No predictive value |
| Categorical | thtr_rel_date | Date the movie is released in theaters | No predictive value |
| Categorical | thtr_rel_day | Day of the month the movie is released in theaters | No predictive value |
| Categorical | thtr_rel_year | Year the movie is released in theaters | No predictive value |
| Categorical | title | Title of movie | No predictive value |
| Categorical | title_type | Type of movie (Documentary, Feature Film, TV Movie) | Highly correlated with genre |
| Numeric | daily_box_office | Daily box office revenue from BoxOfficeMojo.com | Not in data set |
| Numeric | imdb_num_votes | Number of votes on IMDB | Highly correlated with imdb_num_votes_log |
| Numeric | votes_per_day | The number of IMDB Votes / thtr_days | Highly correlated with votes_per_day_scores_log |
| Numeric | votes_per_day_log | Log of votes_per_day | Highly correlated with votes_per_day_scores_log |
| Type | Variable | Description |
|---|---|---|
| Categorical | audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) |
| Categorical | best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| Categorical | best_actress_win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| Categorical | best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| Categorical | best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| Categorical | best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| Categorical | critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) |
| Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| Categorical | studio | The studio that produced the film |
| Categorical | thtr_rel_month | Month the movie is released in theaters |
| Categorical | thtr_rel_season | Season the movie was released in theaters |
| Categorical | top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
| Numeric | audience_score | Audience score on Rotten Tomatoes |
| Numeric | cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| Numeric | cast_experience_log | Log of the sum across all cast members for a film, of the number of films in which each actor appeared |
| Numeric | cast_votes | Total number of allocated IMDB votes per day for the cast of a film |
| Numeric | cast_votes_log | Log of cast_votes |
| Numeric | critics_score | Critics score on Rotten Tomatoes |
| Numeric | daily_box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
| Numeric | director_experience | Total number of films in sample for a director |
| Numeric | director_experience_log | Log of the total number of films directed by the film’s director |
| Numeric | imdb_num_votes_log | Log number of IMDB votes |
| Numeric | imdb_rating | Rating on IMDB |
| Numeric | runtime | Runtime of movie (in minutes) |
| Numeric | runtime_log | Log runtime of movie (in minutes) |
| Numeric | thtr_days | Number of days from theatre release date to January 1, 2016 |
| Numeric | thtr_days_log | Log of thtr_days |
Model Alpha
For this model, a forward selection procedure was undertaken based upon the full model described above. The variables were added as described in Table 7.
Table 7: Model Alpha forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | cast_votes_log | 1 | 2 489 | 536.66 | 0.52 | 0.52 | 0 | 0.00 |
| 2 | genre | 2 | 12 479 | 56.46 | 0.56 | 0.56 | 0 | 6.32 |
| 3 | critics_score | 3 | 13 478 | 59.88 | 0.60 | 0.59 | 0 | 6.31 |
| 4 | best_pic_win | 4 | 14 477 | 57.33 | 0.61 | 0.60 | 0 | 1.53 |
| 5 | cast_experience_log | 5 | 15 476 | 55.37 | 0.62 | 0.61 | 0 | 1.50 |
| 6 | runtime_log | 6 | 16 475 | 53.12 | 0.63 | 0.62 | 0 | 1.15 |
| 7 | director_experience_log | 7 | 17 474 | 50.60 | 0.63 | 0.62 | 0 | 0.49 |
| 8 | thtr_rel_month | 8 | 28 463 | 30.56 | 0.64 | 0.62 | 0 | 0.32 |
| 9 | best_pic_nom | 9 | 29 462 | 29.62 | 0.64 | 0.62 | 0 | 0.16 |
Model Overview
This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}) + \epsilon_i\]
where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is log of cast_votes for movie \(i\)
\(x_{i2}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\)
\(x_{i3}\) is critics score on rotten tomatoes for movie \(i\)
\(x_{i4}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\)
\(x_{i5}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\)
\(x_{i6}\) is log runtime of movie (in minutes) for movie \(i\) \(x_{i7}\) is log of the total number of films directed by the film’s director for movie \(i\) \(x_{i8}\) is month the movie is released in theaters for movie \(i\)
\(\epsilon_i\) is the total residual for the model for movie\(i\)
As suggested by Figure 1, the model was significant (F(29, 462) = 29.625, p < .001), with an adjusted R-squared of 0.621.
Figure 1 Model Alpha Regression
Analysis of Variance
Figure 2 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| cast_votes_log | 1 | 1399.493 | 1399.493 | 675.751 | 0.000 | 52.32 |
| genre | 10 | 110.584 | 11.058 | 5.340 | 0.000 | 4.13 |
| critics_score | 1 | 96.140 | 96.140 | 46.422 | 0.000 | 3.59 |
| best_pic_win | 1 | 24.689 | 24.689 | 11.921 | 0.001 | 0.92 |
| cast_experience_log | 1 | 26.277 | 26.277 | 12.688 | 0.000 | 0.98 |
| runtime_log | 1 | 18.511 | 18.511 | 8.938 | 0.003 | 0.69 |
| director_experience_log | 1 | 11.255 | 11.255 | 5.435 | 0.020 | 0.42 |
| thtr_rel_month | 11 | 26.295 | 2.390 | 1.154 | 0.317 | 0.98 |
| best_pic_nom | 1 | 4.644 | 4.644 | 2.242 | 0.135 | 0.17 |
| Residuals | 462 | 956.811 | 2.071 | NA | NA | 35.77 |
Figure 2 Model Alpha analysis of variance
A two-way analysis of variance was conducted on the influence of 9 independent variables on the log number of IMDB votes. The effect of cast_votes_log on the log of IMDB votes presented an F statistic of F(1, 462), = 675.751, p < .001, exhibiting 52.32% of the variance. The significance of genre on the log of IMDB votes presented an F statistic of F(10, 462), = 5.34, p < .001, representing 4.13% of the variance. The influence of critics_score on the log of IMDB votes indicated an F statistic of F(1, 462), = 46.422, p < .001, accounting for 3.59% of the variance. The influence of best_pic_win on the log of IMDB votes presented an F statistic of F(1, 462), = 11.921, p < .001, accounting for 0.92% of the variance. The force of cast_experience_log on the log of IMDB votes indicated an F statistic of F(1, 462), = 12.688, p < .001, exhibiting 0.98% of the variance. The significance of runtime_log on the log of IMDB votes yielded an F statistic of F(1, 462), = 8.938, p < .01, representing 0.69% of the variance. The significance of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 462), = 5.435, p < .05, representing 0.42% of the variance. The force of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 462), = 1.154, p < 0.317, representing 0.98% of the variance. The effect of best_pic_nom on the log of IMDB votes yielded an F statistic of F(1, 462), = 2.242, p < 0.135, accounting for 0.17% of the variance. Finally, residuals exhibited approximately 35.77% of variance.
Model Coefficients
Table 8 Model Alpha Coefficients| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.932 | 2.181 | 4.553 | 0.000 |
| cast_votes_log | 0.589 | 0.029 | 20.021 | 0.000 |
| genreAnimation | -1.422 | 0.644 | -2.210 | 0.028 |
| genreArt House & International | -2.225 | 0.489 | -4.551 | 0.000 |
| genreComedy | -0.976 | 0.283 | -3.453 | 0.001 |
| genreDocumentary | -2.633 | 0.368 | -7.156 | 0.000 |
| genreDrama | -1.508 | 0.243 | -6.218 | 0.000 |
| genreHorror | -0.583 | 0.418 | -1.395 | 0.164 |
| genreMusical & Performing Arts | -1.929 | 0.539 | -3.579 | 0.000 |
| genreMystery & Suspense | -1.039 | 0.318 | -3.271 | 0.001 |
| genreOther | -1.564 | 0.465 | -3.366 | 0.001 |
| genreScience Fiction & Fantasy | -0.231 | 0.559 | -0.413 | 0.680 |
| critics_score | 0.014 | 0.003 | 5.043 | 0.000 |
| best_pic_winyes | 1.128 | 0.665 | 1.696 | 0.091 |
| cast_experience_log | -0.841 | 0.207 | -4.067 | 0.000 |
| runtime_log | 0.676 | 0.330 | 2.049 | 0.041 |
| director_experience_log | 0.289 | 0.117 | 2.460 | 0.014 |
| thtr_rel_monthFeb | -0.038 | 0.371 | -0.103 | 0.918 |
| thtr_rel_monthMar | 0.320 | 0.323 | 0.991 | 0.322 |
| thtr_rel_monthApr | -0.043 | 0.324 | -0.132 | 0.895 |
| thtr_rel_monthMay | -0.033 | 0.328 | -0.101 | 0.919 |
| thtr_rel_monthJun | 0.233 | 0.289 | 0.808 | 0.419 |
| thtr_rel_monthJul | 0.712 | 0.326 | 2.186 | 0.029 |
| thtr_rel_monthAug | 0.564 | 0.342 | 1.650 | 0.100 |
| thtr_rel_monthSep | -0.024 | 0.314 | -0.076 | 0.939 |
| thtr_rel_monthOct | 0.124 | 0.294 | 0.422 | 0.673 |
| thtr_rel_monthNov | 0.311 | 0.326 | 0.954 | 0.341 |
| thtr_rel_monthDec | 0.394 | 0.306 | 1.289 | 0.198 |
| best_pic_nomyes | 0.644 | 0.430 | 1.497 | 0.135 |
As shown in Table 8, the predicted log of the number of IMDB votes for a film was 9.932 + a genre factor associated with the genre of the film + plus 0.014 log votes for each point of the critics score 1.128 log votes for log number of films in which the cast appears + -0.841 log votes for each minute of movie runtime. In addition, 0.676 log votes if the fil was nominated for a best picture oscar + 0.289 log votes for each film directed by the director of the film. An additional 0.644 log votes if the film received a best picture oscar. Additional log votes are added (or subtract) based upon the month it was released. The genres, in order of contribution to the log number of IMDB votes estimated, are:
Table 9 Model Alpha Genres and Popularity| Estimate | Std. Error | t value | Pr(>|t|) | coef |
|---|---|---|---|---|
| -0.231 | 0.559 | -0.413 | 0.680 | genreScience Fiction & Fantasy |
| -0.583 | 0.418 | -1.395 | 0.164 | genreHorror |
| -0.976 | 0.283 | -3.453 | 0.001 | genreComedy |
| -1.039 | 0.318 | -3.271 | 0.001 | genreMystery & Suspense |
| -1.422 | 0.644 | -2.210 | 0.028 | genreAnimation |
| -1.508 | 0.243 | -6.218 | 0.000 | genreDrama |
| -1.564 | 0.465 | -3.366 | 0.001 | genreOther |
| -1.929 | 0.539 | -3.579 | 0.000 | genreMusical & Performing Arts |
| -2.225 | 0.489 | -4.551 | 0.000 | genreArt House & International |
| -2.633 | 0.368 | -7.156 | 0.000 | genreDocumentary |
According to this model, Science Fiction and Fantasy, Horror and Comedy films are the most popular.
Table 10 Model Alpha Timing and Popularity| Estimate | Std. Error | t value | Pr(>|t|) | coef |
|---|---|---|---|---|
| 0.712 | 0.326 | 2.186 | 0.029 | thtr_rel_monthJul |
| 0.564 | 0.342 | 1.650 | 0.100 | thtr_rel_monthAug |
| 0.394 | 0.306 | 1.289 | 0.198 | thtr_rel_monthDec |
| 0.320 | 0.323 | 0.991 | 0.322 | thtr_rel_monthMar |
| 0.311 | 0.326 | 0.954 | 0.341 | thtr_rel_monthNov |
| 0.233 | 0.289 | 0.808 | 0.419 | thtr_rel_monthJun |
| 0.124 | 0.294 | 0.422 | 0.673 | thtr_rel_monthOct |
| -0.024 | 0.314 | -0.076 | 0.939 | thtr_rel_monthSep |
| -0.033 | 0.328 | -0.101 | 0.919 | thtr_rel_monthMay |
| -0.038 | 0.371 | -0.103 | 0.918 | thtr_rel_monthFeb |
| -0.043 | 0.324 | -0.132 | 0.895 | thtr_rel_monthApr |
As indicated in Table 10, mid to late summer and November are the best months in which to launch a film.
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 3.
Figure 3 Model Alpha linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(29), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 4) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 4 Model Alpha homoscedasticity plot
The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .01). As such, the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 5 illustrate the distribution of residuals.
Figure 5 Model Alpha residuals plot
The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.986, p = 0) and the skewness (-0.463) and kurtosis (3.194) supported the assumption of normaility.
Multicollinearity
As shown in Figure 6 and Table 11, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 2 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 6: Model Alpha correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| cast_votes_log | 1.612 | 1 | 1.270 |
| genre | 2.139 | 10 | 1.039 |
| critics_score | 1.365 | 1 | 1.168 |
| best_pic_win | 1.475 | 1 | 1.215 |
| cast_experience_log | 1.632 | 1 | 1.278 |
| runtime_log | 1.427 | 1 | 1.195 |
| director_experience_log | 1.160 | 1 | 1.077 |
| thtr_rel_month | 1.532 | 11 | 1.020 |
| best_pic_nom | 1.549 | 1 | 1.245 |
Outliers
Figure 7 Model Alpha Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 22 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.
Model Beta
For this model, a forward selection procedure was undertaken based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 12
Table 12: Model Beta forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | cast_votes_log | 1 | 2 467 | 651.36 | 0.58 | 0.58 | 0 | 0.00 |
| 2 | genre | 2 | 12 457 | 67.79 | 0.62 | 0.61 | 0 | 4.98 |
| 3 | critics_score | 3 | 13 456 | 71.52 | 0.65 | 0.64 | 0 | 5.40 |
| 4 | cast_experience_log | 4 | 14 455 | 68.77 | 0.66 | 0.65 | 0 | 1.40 |
| 5 | best_pic_win | 5 | 15 454 | 66.67 | 0.67 | 0.66 | 0 | 1.53 |
| 6 | runtime_log | 6 | 16 453 | 64.20 | 0.68 | 0.67 | 0 | 1.06 |
| 7 | director_experience_log | 7 | 17 452 | 61.51 | 0.68 | 0.67 | 0 | 0.60 |
| 8 | mpaa_rating | 8 | 21 448 | 49.56 | 0.69 | 0.68 | 0 | 0.15 |
| 9 | thtr_rel_month | 9 | 32 437 | 32.50 | 0.70 | 0.68 | 0 | 0.15 |
| 10 | best_actress_win | 10 | 33 436 | 31.62 | 0.70 | 0.68 | 0 | 0.15 |
Model Overview
This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}+ \beta_5 x_{i6}+ \beta_5 x_{i7}+ \beta_5 x_{i8}+ \beta_5 x_{i9}+ \beta_5 x_{i10}) + \epsilon_i\]
where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is log of cast_votes for movie \(i\)
\(x_{i2}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\)
\(x_{i3}\) is critics score on rotten tomatoes for movie \(i\)
\(x_{i4}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\)
\(x_{i5}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\)
\(x_{i6}\) is log runtime of movie (in minutes) for movie \(i\)
\(x_{i7}\) is log of the total number of films directed by the film’s director for movie \(i\)
\(x_{i8}\) is mpaa rating of the movie (g, pg, pg-13, r, unrated) for movie \(i\)
\(\epsilon_i\) is the total residual for the model for movie\(i\)
As suggested by Figure 8, the model was significant (F(33, 436) = 31.618, p < .001), with an adjusted R-squared of 0.677.
Figure 8 Model Beta Regression
Analysis of Variance
Figure 9 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| cast_votes_log | 1 | 1411.992 | 1411.992 | 843.213 | 0.000 | 58.24 |
| genre | 10 | 91.169 | 9.117 | 5.444 | 0.000 | 3.76 |
| critics_score | 1 | 80.036 | 80.036 | 47.796 | 0.000 | 3.30 |
| cast_experience_log | 1 | 23.489 | 23.489 | 14.027 | 0.000 | 0.97 |
| best_pic_win | 1 | 24.312 | 24.312 | 14.519 | 0.000 | 1.00 |
| runtime_log | 1 | 17.794 | 17.794 | 10.626 | 0.001 | 0.73 |
| director_experience_log | 1 | 12.500 | 12.500 | 7.464 | 0.007 | 0.52 |
| mpaa_rating | 4 | 8.385 | 2.096 | 1.252 | 0.288 | 0.35 |
| thtr_rel_month | 11 | 21.177 | 1.925 | 1.150 | 0.321 | 0.87 |
| best_actress_win | 1 | 3.390 | 3.390 | 2.025 | 0.155 | 0.14 |
| Residuals | 436 | 730.099 | 1.675 | NA | NA | 30.12 |
Figure 9 Model Beta analysis of variance
A two-way analysis of variance was conducted on the influence of 10 independent variables on the log number of IMDB votes. The influence of cast_votes_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 843.213, p < .001, accounting for 58.24% of the variance. The force of genre on the log of IMDB votes produced an F statistic of F(10, 436), = 5.444, p < .001, exhibiting 3.76% of the variance. The significance of critics_score on the log of IMDB votes yielded an F statistic of F(1, 436), = 47.796, p < .001, exhibiting 3.3% of the variance. The effect of cast_experience_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 14.027, p < .001, exhibiting 0.97% of the variance. The significance of best_pic_win on the log of IMDB votes indicated an F statistic of F(1, 436), = 14.519, p < .001, accounting for 1% of the variance. The force of runtime_log on the log of IMDB votes produced an F statistic of F(1, 436), = 10.626, p < .01, accounting for 0.73% of the variance. The force of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 7.464, p < .01, accounting for 0.52% of the variance. The force of mpaa_rating on the log of IMDB votes presented an F statistic of F(4, 436), = 1.252, p < 0.288, representing 0.35% of the variance. The influence of thtr_rel_month on the log of IMDB votes yielded an F statistic of F(11, 436), = 1.15, p < 0.321, representing 0.87% of the variance. The force of best_actress_win on the log of IMDB votes produced an F statistic of F(1, 436), = 2.025, p < 0.155, representing 0.14% of the variance. Finally, residuals exhibited some 30.12% of variance.
Model Coefficients
Table 13 Model Beta Coefficients| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 8.911 | 2.043 | 4.361 | 0.000 |
| cast_votes_log | 0.609 | 0.028 | 21.379 | 0.000 |
| genreAnimation | -1.351 | 0.742 | -1.820 | 0.069 |
| genreArt House & International | -1.543 | 0.522 | -2.955 | 0.003 |
| genreComedy | -0.800 | 0.260 | -3.077 | 0.002 |
| genreDocumentary | -2.145 | 0.376 | -5.710 | 0.000 |
| genreDrama | -1.368 | 0.224 | -6.097 | 0.000 |
| genreHorror | -0.384 | 0.384 | -0.999 | 0.318 |
| genreMusical & Performing Arts | -1.807 | 0.539 | -3.351 | 0.001 |
| genreMystery & Suspense | -0.875 | 0.297 | -2.949 | 0.003 |
| genreOther | -1.427 | 0.488 | -2.922 | 0.004 |
| genreScience Fiction & Fantasy | -0.159 | 0.504 | -0.316 | 0.752 |
| critics_score | 0.014 | 0.003 | 5.329 | 0.000 |
| cast_experience_log | -0.788 | 0.193 | -4.073 | 0.000 |
| best_pic_winyes | 1.665 | 0.523 | 3.185 | 0.002 |
| runtime_log | 0.825 | 0.308 | 2.679 | 0.008 |
| director_experience_log | 0.260 | 0.107 | 2.425 | 0.016 |
| mpaa_ratingPG | -0.277 | 0.429 | -0.645 | 0.519 |
| mpaa_ratingPG-13 | -0.041 | 0.440 | -0.092 | 0.927 |
| mpaa_ratingR | -0.273 | 0.420 | -0.649 | 0.517 |
| mpaa_ratingUnrated | -0.756 | 0.511 | -1.480 | 0.140 |
| thtr_rel_monthFeb | 0.065 | 0.351 | 0.184 | 0.854 |
| thtr_rel_monthMar | 0.424 | 0.302 | 1.403 | 0.161 |
| thtr_rel_monthApr | 0.048 | 0.299 | 0.161 | 0.872 |
| thtr_rel_monthMay | 0.183 | 0.308 | 0.595 | 0.552 |
| thtr_rel_monthJun | 0.199 | 0.267 | 0.747 | 0.455 |
| thtr_rel_monthJul | 0.599 | 0.297 | 2.017 | 0.044 |
| thtr_rel_monthAug | 0.699 | 0.315 | 2.217 | 0.027 |
| thtr_rel_monthSep | -0.049 | 0.291 | -0.168 | 0.866 |
| thtr_rel_monthOct | 0.079 | 0.267 | 0.297 | 0.767 |
| thtr_rel_monthNov | 0.291 | 0.300 | 0.970 | 0.333 |
| thtr_rel_monthDec | 0.269 | 0.278 | 0.965 | 0.335 |
| best_actress_winyes | -0.281 | 0.198 | -1.423 | 0.155 |
As shown in Table 13, the predicted log of the number of IMDB votes for a film was 8.911 + 0.609 log votes for each log vote earned by a cast member. A factor is added for the various genres as follows:
Table 14 Model Beta Genres and Popularity| Estimate | Std. Error | t value | Pr(>|t|) | coef |
|---|---|---|---|---|
| -0.159 | 0.504 | -0.316 | 0.752 | genreScience Fiction & Fantasy |
| -0.384 | 0.384 | -0.999 | 0.318 | genreHorror |
| -0.800 | 0.260 | -3.077 | 0.002 | genreComedy |
| -0.875 | 0.297 | -2.949 | 0.003 | genreMystery & Suspense |
| -1.351 | 0.742 | -1.820 | 0.069 | genreAnimation |
| -1.368 | 0.224 | -6.097 | 0.000 | genreDrama |
| -1.427 | 0.488 | -2.922 | 0.004 | genreOther |
| -1.543 | 0.522 | -2.955 | 0.003 | genreArt House & International |
| -1.807 | 0.539 | -3.351 | 0.001 | genreMusical & Performing Arts |
| -2.145 | 0.376 | -5.710 | 0.000 | genreDocumentary |
It should be noted that most, not all estimates were statistically significant.
An additional 0.014 log votes are earned for each point of the critics score -0.788 log votes for log number of the sum of films in which the top 5 cast members appeared, + 1.665 log votes for each log minute of runtime, plus 1.665 log votes if the film won best picture, + 0.26 log votes for the log of the number of films that the director had directed, + -0.281 log votes if the film was nominated for best picture. The estimate of the log number of IMDB votes is adjusted according to the month of release as follows:
Table 15 Model Beta Timing and Popularity| Estimate | Std. Error | t value | Pr(>|t|) | coef |
|---|---|---|---|---|
| 0.699 | 0.315 | 2.217 | 0.027 | thtr_rel_monthAug |
| 0.599 | 0.297 | 2.017 | 0.044 | thtr_rel_monthJul |
| 0.424 | 0.302 | 1.403 | 0.161 | thtr_rel_monthMar |
| 0.291 | 0.300 | 0.970 | 0.333 | thtr_rel_monthNov |
| 0.269 | 0.278 | 0.965 | 0.335 | thtr_rel_monthDec |
| 0.199 | 0.267 | 0.747 | 0.455 | thtr_rel_monthJun |
| 0.183 | 0.308 | 0.595 | 0.552 | thtr_rel_monthMay |
| 0.079 | 0.267 | 0.297 | 0.767 | thtr_rel_monthOct |
| 0.065 | 0.351 | 0.184 | 0.854 | thtr_rel_monthFeb |
| 0.048 | 0.299 | 0.161 | 0.872 | thtr_rel_monthApr |
| -0.049 | 0.291 | -0.168 | 0.866 | thtr_rel_monthSep |
In should be known; however, that many of the estimates for the months were not significant.
Lastly, an adjustment to the estimate is made for the MPAA rating.
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 10.
Figure 10 Model beta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(33), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 11) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 11 Model beta homoscedasticity plot
The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 12 illustrate the distribution of residuals.
Figure 12 Model beta residuals plot
The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.991, p = 0.007) and the skewness (-0.297) and kurtosis (2.842) supported the assumption of normaility.
Multicollinearity
As shown in Figure 13 and Table 16, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 13: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| cast_votes_log | 1.752 | 1 | 1.324 |
| genre | 4.060 | 10 | 1.073 |
| critics_score | 1.435 | 1 | 1.198 |
| cast_experience_log | 1.707 | 1 | 1.306 |
| best_pic_win | 1.126 | 1 | 1.061 |
| runtime_log | 1.488 | 1 | 1.220 |
| director_experience_log | 1.160 | 1 | 1.077 |
| mpaa_rating | 2.835 | 4 | 1.139 |
| thtr_rel_month | 1.658 | 11 | 1.023 |
| best_actress_win | 1.205 | 1 | 1.098 |
Outliers
Figure 14 Model Beta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 19 cases exerting undue influence on the model.
Model Gamma
For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 17
Table 17: Model Gamma| Steps | Removed | p.value |
|---|---|---|
| 1 | best_actress_win | 0.53 |
Model Overview
This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_1 x_{i6} + \beta_2 x_{i7} + \beta_3 x_{i8} + \beta_4 x_{i9} + \beta_5 x_{i10} + \beta_5 x_{i11} + \beta_5 x_{i12} + \beta_5 x_{i13}) + \epsilon_i\]
where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\) \(x_{i2}\) is mpaa rating of the movie (g, pg, pg-13, r, unrated) for movie \(i\) \(x_{i3}\) is month the movie is released in theaters for movie \(i\) \(x_{i4}\) is whether or not the movie was nominated for a best picture oscar (no, yes) for movie \(i\) \(x_{i5}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\) \(x_{i6}\) is whether or not one of the main actors in the movie ever won an oscar (no, yes) – note that this is not necessarily whether the actor won an oscar for their role in the given movie for movie \(i\) \(x_{i7}\) is whether or not the director of the movie ever won an oscar (no, yes) – not that this is not necessarily whether the director won an oscar for the given movie for movie \(i\) \(x_{i8}\) is critics score on rotten tomatoes for movie \(i\) \(x_{i9}\) is number of days from theatre release date to january 1, 2,016 for movie \(i\) \(x_{i10}\) is log runtime of movie (in minutes) for movie \(i\) \(x_{i11}\) is log of cast_votes for movie \(i\) \(x_{i12}\) is log of the total number of films directed by the film’s director for movie \(i\) \(x_{i13}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\) \(\epsilon_i\) is the total residual for the model for movie\(i\)
As suggested by Figure 15, the model was significant (F(36, 455) = 23.956, p < .001), with an adjusted R-squared of 0.621.
Figure 15 Model Gamma Regression
Analysis of Variance
Figure 16 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| genre | 10 | 376.261 | 37.626 | 18.196 | 0.000 | 14.07 |
| mpaa_rating | 4 | 99.942 | 24.986 | 12.083 | 0.000 | 3.74 |
| thtr_rel_month | 11 | 103.141 | 9.376 | 4.534 | 0.000 | 3.86 |
| best_pic_nom | 1 | 111.151 | 111.151 | 53.752 | 0.000 | 4.16 |
| best_pic_win | 1 | 26.967 | 26.967 | 13.041 | 0.000 | 1.01 |
| best_actor_win | 1 | 17.770 | 17.770 | 8.594 | 0.004 | 0.66 |
| best_dir_win | 1 | 25.382 | 25.382 | 12.275 | 0.001 | 0.95 |
| critics_score | 1 | 145.037 | 145.037 | 70.139 | 0.000 | 5.42 |
| thtr_days | 1 | 207.620 | 207.620 | 100.404 | 0.000 | 7.76 |
| runtime_log | 1 | 60.489 | 60.489 | 29.252 | 0.000 | 2.26 |
| cast_votes_log | 1 | 519.590 | 519.590 | 251.271 | 0.000 | 19.43 |
| director_experience_log | 1 | 9.104 | 9.104 | 4.403 | 0.036 | 0.34 |
| cast_experience_log | 1 | 31.373 | 31.373 | 15.172 | 0.000 | 1.17 |
| Residuals | 455 | 940.870 | 2.068 | NA | NA | 35.18 |
Figure 16 Model Gamma analysis of variance
A two-way analysis of variance was conducted on the influence of 13 independent variables on the log number of IMDB votes. The effect of genre on the log of IMDB votes indicated an F statistic of F(10, 455), = 18.196, p < .001, accounting for 14.07% of the variance. The force of mpaa_rating on the log of IMDB votes indicated an F statistic of F(4, 455), = 12.083, p < .001, expressing 3.74% of the variance. The influence of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 455), = 4.534, p < .001, expressing 3.86% of the variance. The force of best_pic_nom on the log of IMDB votes presented an F statistic of F(1, 455), = 53.752, p < .001, exhibiting 4.16% of the variance. The force of best_pic_win on the log of IMDB votes indicated an F statistic of F(1, 455), = 13.041, p < .001, accounting for 1.01% of the variance. The significance of best_actor_win on the log of IMDB votes indicated an F statistic of F(1, 455), = 8.594, p < .01, representing 0.66% of the variance. The force of best_dir_win on the log of IMDB votes yielded an F statistic of F(1, 455), = 12.275, p < .001, representing 0.95% of the variance. The force of critics_score on the log of IMDB votes yielded an F statistic of F(1, 455), = 70.139, p < .001, expressing 5.42% of the variance. The significance of thtr_days on the log of IMDB votes presented an F statistic of F(1, 455), = 100.404, p < .001, representing 7.76% of the variance. The influence of runtime_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 29.252, p < .001, expressing 2.26% of the variance. The effect of cast_votes_log on the log of IMDB votes presented an F statistic of F(1, 455), = 251.271, p < .001, expressing 19.43% of the variance. The force of director_experience_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 4.403, p < .05, accounting for 0.34% of the variance. The significance of cast_experience_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 15.172, p < .001, exhibiting 1.17% of the variance. Finally, residuals expressed a 35.18% of variance.
Model Coefficients
Table 18 Model Gamma Coefficients| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 10.042 | 2.284 | 4.398 | 0.000 |
| genreAnimation | -1.432 | 0.704 | -2.036 | 0.042 |
| genreArt House & International | -2.088 | 0.501 | -4.171 | 0.000 |
| genreComedy | -1.011 | 0.286 | -3.535 | 0.000 |
| genreDocumentary | -2.484 | 0.411 | -6.042 | 0.000 |
| genreDrama | -1.492 | 0.249 | -6.006 | 0.000 |
| genreHorror | -0.466 | 0.426 | -1.092 | 0.276 |
| genreMusical & Performing Arts | -1.954 | 0.545 | -3.588 | 0.000 |
| genreMystery & Suspense | -0.989 | 0.322 | -3.072 | 0.002 |
| genreOther | -1.474 | 0.471 | -3.132 | 0.002 |
| genreScience Fiction & Fantasy | -0.278 | 0.560 | -0.496 | 0.620 |
| mpaa_ratingPG | -0.129 | 0.461 | -0.280 | 0.780 |
| mpaa_ratingPG-13 | 0.135 | 0.483 | 0.279 | 0.780 |
| mpaa_ratingR | -0.154 | 0.459 | -0.335 | 0.738 |
| mpaa_ratingUnrated | -0.601 | 0.566 | -1.061 | 0.289 |
| thtr_rel_monthFeb | -0.015 | 0.371 | -0.041 | 0.967 |
| thtr_rel_monthMar | 0.387 | 0.325 | 1.190 | 0.235 |
| thtr_rel_monthApr | -0.020 | 0.329 | -0.059 | 0.953 |
| thtr_rel_monthMay | 0.019 | 0.331 | 0.058 | 0.953 |
| thtr_rel_monthJun | 0.288 | 0.293 | 0.983 | 0.326 |
| thtr_rel_monthJul | 0.725 | 0.329 | 2.204 | 0.028 |
| thtr_rel_monthAug | 0.534 | 0.346 | 1.544 | 0.123 |
| thtr_rel_monthSep | 0.075 | 0.320 | 0.235 | 0.814 |
| thtr_rel_monthOct | 0.157 | 0.299 | 0.525 | 0.600 |
| thtr_rel_monthNov | 0.388 | 0.331 | 1.172 | 0.242 |
| thtr_rel_monthDec | 0.434 | 0.307 | 1.412 | 0.159 |
| best_pic_nomyes | 0.701 | 0.436 | 1.606 | 0.109 |
| best_pic_winyes | 0.843 | 0.700 | 1.205 | 0.229 |
| best_actor_winyes | -0.194 | 0.209 | -0.931 | 0.352 |
| best_dir_winyes | 0.429 | 0.289 | 1.485 | 0.138 |
| critics_score | 0.015 | 0.003 | 5.205 | 0.000 |
| thtr_days | 0.000 | 0.000 | -0.600 | 0.549 |
| runtime_log | 0.680 | 0.348 | 1.956 | 0.051 |
| cast_votes_log | 0.569 | 0.037 | 15.535 | 0.000 |
| director_experience_log | 0.249 | 0.125 | 1.995 | 0.047 |
| cast_experience_log | -0.834 | 0.214 | -3.895 | 0.000 |
The coefficients for this model were the same of those for model Beta; however, the order in which the variables were added was different as was the estimates.
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 17.
Figure 17 Model beta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(36), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 18) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 18 Model beta homoscedasticity plot
The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .01). As such, the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 19 illustrate the distribution of residuals.
Figure 19 Model beta residuals plot
The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.99, p = 0.001) and the skewness (-0.386) and kurtosis (3.139) supported the assumption of normaility.
Multicollinearity
As shown in Figure 20 and Table 19, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 20: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 4.405 | 10 | 1.077 |
| mpaa_rating | 3.173 | 4 | 1.155 |
| thtr_rel_month | 1.895 | 11 | 1.029 |
| best_pic_nom | 1.595 | 1 | 1.263 |
| best_pic_win | 1.636 | 1 | 1.279 |
| best_actor_win | 1.232 | 1 | 1.110 |
| best_dir_win | 1.381 | 1 | 1.175 |
| critics_score | 1.499 | 1 | 1.224 |
| thtr_days | 1.966 | 1 | 1.402 |
| runtime_log | 1.589 | 1 | 1.260 |
| cast_votes_log | 2.498 | 1 | 1.580 |
| director_experience_log | 1.313 | 1 | 1.146 |
| cast_experience_log | 1.755 | 1 | 1.325 |
Outliers
Figure 21 Model Gamma Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 20 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.
Model Delta
For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 20
Table 20: Model Delta| Steps | Removed | p.value |
|---|---|---|
| 1 | best_actress_win | 0.52 |
Model Overview
This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_1 x_{i6} + \beta_2 x_{i7} + \beta_3 x_{i8} + \beta_4 x_{i9} + \beta_5 x_{i10}) + \beta_5 x_{i11}+ \beta_5 x_{i12}) + \epsilon_i\]
where:
\(\epsilon_i\) is the total residual for the model for movie\(i\)
As suggested by Figure 22, the model was significant (F(36, 435) = 28.66, p < .001), with an adjusted R-squared of 0.673.
Figure 22 Model Delta Regression
Analysis of Variance
Figure 23 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| genre | 10 | 368.668 | 36.867 | 21.669 | 0.000 | 15.07 |
| mpaa_rating | 4 | 113.484 | 28.371 | 16.676 | 0.000 | 4.64 |
| thtr_rel_month | 11 | 89.408 | 8.128 | 4.777 | 0.000 | 3.65 |
| best_pic_nom | 1 | 99.674 | 99.674 | 58.586 | 0.000 | 4.07 |
| best_pic_win | 1 | 31.941 | 31.941 | 18.774 | 0.000 | 1.31 |
| best_actor_win | 1 | 21.859 | 21.859 | 12.848 | 0.000 | 0.89 |
| best_dir_win | 1 | 23.668 | 23.668 | 13.911 | 0.000 | 0.97 |
| critics_score | 1 | 122.917 | 122.917 | 72.248 | 0.000 | 5.02 |
| thtr_days | 1 | 200.765 | 200.765 | 118.005 | 0.000 | 8.21 |
| runtime_log | 1 | 72.442 | 72.442 | 42.580 | 0.000 | 2.96 |
| cast_votes_log | 1 | 516.605 | 516.605 | 303.649 | 0.000 | 21.11 |
| director_experience_log | 1 | 11.883 | 11.883 | 6.985 | 0.009 | 0.49 |
| cast_experience_log | 1 | 33.313 | 33.313 | 19.580 | 0.000 | 1.36 |
| Residuals | 435 | 740.075 | 1.701 | NA | NA | 30.25 |
Figure 23 Model Delta analysis of variance
A two-way analysis of variance was conducted on the influence of 13 independent variables on the log number of IMDB votes. The effect of genre on the log of IMDB votes presented an F statistic of F(10, 435), = 21.669, p < .001, expressing 15.07% of the variance. The influence of mpaa_rating on the log of IMDB votes presented an F statistic of F(4, 435), = 16.676, p < .001, expressing 4.64% of the variance. The effect of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 435), = 4.777, p < .001, accounting for 3.65% of the variance. The force of best_pic_nom on the log of IMDB votes yielded an F statistic of F(1, 435), = 58.586, p < .001, representing 4.07% of the variance. The force of best_pic_win on the log of IMDB votes presented an F statistic of F(1, 435), = 18.774, p < .001, expressing 1.31% of the variance. The significance of best_actor_win on the log of IMDB votes yielded an F statistic of F(1, 435), = 12.848, p < .001, expressing 0.89% of the variance. The force of best_dir_win on the log of IMDB votes indicated an F statistic of F(1, 435), = 13.911, p < .001, exhibiting 0.97% of the variance. The influence of critics_score on the log of IMDB votes yielded an F statistic of F(1, 435), = 72.248, p < .001, expressing 5.02% of the variance. The force of thtr_days on the log of IMDB votes yielded an F statistic of F(1, 435), = 118.005, p < .001, expressing 8.21% of the variance. The effect of runtime_log on the log of IMDB votes presented an F statistic of F(1, 435), = 42.58, p < .001, accounting for 2.96% of the variance. The force of cast_votes_log on the log of IMDB votes yielded an F statistic of F(1, 435), = 303.649, p < .001, accounting for 21.11% of the variance. The effect of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 435), = 6.985, p < .01, representing 0.49% of the variance. The influence of cast_experience_log on the log of IMDB votes produced an F statistic of F(1, 435), = 19.58, p < .001, representing 1.36% of the variance. Finally, residuals represented some 30.25% of variance.
Model Coefficients
Table 21 Model Delta Coefficients| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.669 | 2.120 | 4.562 | 0.000 |
| genreAnimation | -1.144 | 0.763 | -1.499 | 0.135 |
| genreArt House & International | -1.735 | 0.475 | -3.654 | 0.000 |
| genreComedy | -0.914 | 0.263 | -3.479 | 0.001 |
| genreDocumentary | -2.303 | 0.388 | -5.941 | 0.000 |
| genreDrama | -1.452 | 0.227 | -6.399 | 0.000 |
| genreHorror | -0.433 | 0.388 | -1.116 | 0.265 |
| genreMusical & Performing Arts | -1.864 | 0.546 | -3.415 | 0.001 |
| genreMystery & Suspense | -0.883 | 0.295 | -2.989 | 0.003 |
| genreOther | -1.749 | 0.472 | -3.705 | 0.000 |
| genreScience Fiction & Fantasy | -0.224 | 0.509 | -0.440 | 0.660 |
| mpaa_ratingPG | -0.042 | 0.454 | -0.092 | 0.926 |
| mpaa_ratingPG-13 | 0.231 | 0.472 | 0.490 | 0.624 |
| mpaa_ratingR | -0.033 | 0.451 | -0.073 | 0.942 |
| mpaa_ratingUnrated | -0.545 | 0.548 | -0.995 | 0.320 |
| thtr_rel_monthFeb | -0.033 | 0.352 | -0.095 | 0.925 |
| thtr_rel_monthMar | 0.458 | 0.302 | 1.515 | 0.130 |
| thtr_rel_monthApr | -0.093 | 0.302 | -0.309 | 0.758 |
| thtr_rel_monthMay | 0.217 | 0.312 | 0.695 | 0.487 |
| thtr_rel_monthJun | 0.096 | 0.270 | 0.356 | 0.722 |
| thtr_rel_monthJul | 0.586 | 0.301 | 1.951 | 0.052 |
| thtr_rel_monthAug | 0.643 | 0.322 | 1.997 | 0.046 |
| thtr_rel_monthSep | -0.045 | 0.297 | -0.152 | 0.879 |
| thtr_rel_monthOct | 0.037 | 0.274 | 0.137 | 0.891 |
| thtr_rel_monthNov | 0.166 | 0.304 | 0.547 | 0.585 |
| thtr_rel_monthDec | 0.233 | 0.282 | 0.827 | 0.409 |
| best_pic_nomyes | 0.500 | 0.412 | 1.216 | 0.225 |
| best_pic_winyes | 1.055 | 0.644 | 1.639 | 0.102 |
| best_actor_winyes | -0.103 | 0.195 | -0.530 | 0.596 |
| best_dir_winyes | 0.350 | 0.263 | 1.328 | 0.185 |
| critics_score | 0.013 | 0.003 | 4.919 | 0.000 |
| thtr_days | 0.000 | 0.000 | -0.241 | 0.809 |
| runtime_log | 0.741 | 0.322 | 2.298 | 0.022 |
| cast_votes_log | 0.593 | 0.035 | 17.057 | 0.000 |
| director_experience_log | 0.289 | 0.115 | 2.515 | 0.012 |
| cast_experience_log | -0.882 | 0.199 | -4.425 | 0.000 |
The coefficients foer this model was the same as those of model Gamma. The order was different as were the estimates.
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 24.
Figure 24 Model beta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(36), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 25) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 25 Model beta homoscedasticity plot
The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 26 illustrate the distribution of residuals.
Figure 26 Model beta residuals plot
The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.992, p = 0.011) and the skewness (-0.281) and kurtosis (2.851) supported the assumption of normaility.
Multicollinearity
As shown in Figure 27 and Table 22, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 27: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 4.481 | 10 | 1.078 |
| mpaa_rating | 3.226 | 4 | 1.158 |
| thtr_rel_month | 1.886 | 11 | 1.029 |
| best_pic_nom | 1.631 | 1 | 1.277 |
| best_pic_win | 1.679 | 1 | 1.296 |
| best_actor_win | 1.234 | 1 | 1.111 |
| best_dir_win | 1.388 | 1 | 1.178 |
| critics_score | 1.502 | 1 | 1.226 |
| thtr_days | 2.055 | 1 | 1.433 |
| runtime_log | 1.602 | 1 | 1.266 |
| cast_votes_log | 2.599 | 1 | 1.612 |
| director_experience_log | 1.318 | 1 | 1.148 |
| cast_experience_log | 1.759 | 1 | 1.326 |
Outliers
Figure 28 Model Delta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 19 cases exerting undue influence on the model.
Model Comparisons
To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.
Table 23 Summary of models| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 9 | 29 | 462 | 29.625 | 1.439 | 1.439 | 0.642 | 0.621 | 0 | 64.227 |
| Model Beta | 10 | 33 | 436 | 31.618 | 1.294 | 1.294 | 0.699 | 0.677 | 0 | 69.885 |
| Model Gamma | 13 | 36 | 455 | 23.956 | 1.438 | 1.438 | 0.648 | 0.621 | 0 | 64.823 |
| Model Delta | 13 | 36 | 435 | 28.660 | 1.304 | 1.304 | 0.698 | 0.673 | 0 | 69.752 |
Forward Selection vs. Backward Elimination
As shown in Table 23, the forward selection algorithm produced fewer predictors than the backward elimination algorithm. Notwithstanding, the differences in root mean square error for the models was not significant -0.08% and -0.79%. Similarly, the differences in adjusted R-squared were -0.09% and -0.53%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (0.93% and -0.19%).
Drop or Not
The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (11.21% and 10.25%) were somewhat significant, as were the differences in adjusted R-squared (9.05% and 8.37%), and the percent of variance explained (8.81% and 7.6%).
Prediction Accuracy
The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:
- MAPE - Mean Absolute Percentage Error
- MPE - Mean Percentage Error
- MSE - Mean Squared Error
- RMSE - Root Mean Squared Error
In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.
Table 24 Model Predictive Accuracy Summary| Model | Size | F Statistic | R-Squared | Adj R-Squared | % Variance | MAPE | MPE | MSE | RMSE | X..Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 9 | 29.625 | 0.642 | 0.621 | 64.227 | 8.415 | -1.165 | 2.069 | 1.438 | 94.355 |
| Model Beta | 10 | 31.618 | 0.699 | 0.677 | 69.885 | 8.372 | -1.578 | 2.156 | 1.468 | 93.548 |
| Model Gamma | 13 | 23.956 | 0.648 | 0.621 | 64.823 | 8.373 | -0.894 | 2.072 | 1.439 | 96.774 |
| Model Delta | 13 | 28.660 | 0.698 | 0.673 | 69.752 | 8.444 | -1.374 | 2.124 | 1.457 | 93.548 |
There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicates that all models were biased with over predictions. From a percent accuracy perspective, it is worth noting that the forward selection and backward selection models performed identically with and without the influence points. Indeed, the Alpha and Gamma models performed equally well; however, the Alpha model was able to do so with just 4 fewer variables. Therefore, the most parsimonious model, Alpha would advance to the movie prediction stage.
The Model
The prediction equation was defined as follows: $y_i = $ 9.932304 + 0.589\(x_1\) + -1.422\(x_2\) + -2.225\(x_3\) + -0.976\(x_4\) + -2.633\(x_5\) + -1.508\(x_6\) + -0.583\(x_7\) + -1.929\(x_8\) + -1.039\(x_9\) + -1.564\(x_10\) + -0.231\(x_11\) + 0.014\(x_12\) + 1.128\(x_13\) + -0.841\(x_14\) + 0.676\(x_15\) + 0.289\(x_16\) + -0.038\(x_17\)